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ABSTRACT 

Automatic  suggestion  of  alternative  terms  to  refine  a  user’s  query 
is  an  effective  technique  to  help  the  user  quickly  narrow  down  to 
his(her)  specific  information  need.  However,  evaluating  the  effec¬ 
tiveness  of  these  suggestions  has  remained  quite  subjective,  with  a 
vast  majority  of  the  past  work  relying  on  expensive  user  studies. 

In  this  work,  we  look  at  this  problem  from  the  IR  perspective.  We 
propose  two  objective  measures  that  evaluate  the  quality  of  Query 
Refinement  (QR)  suggestions,  based  on  the  degree  to  which  the 
documents  retrieved  by  the  QR  suggestions,  when  used  as  queries, 
capture  the  overall  sub-topical  structure  underlying  the  topic  of  the 
original  query.  The  first  measure,  known  as  Maximum  Matching 
Averaged  Mean  Average  Precision  (MM-AMAP)  requires  labeled 
documents  for  the  sub-topics  underlying  the  query's  topic.  The 
second  measure  which  we  call  Distinctness  and  MAP  based  FI 
(DMAP-F1)  requires  only  labeled  documents  that  are  relevant  to 
the  original  query. 

We  also  define  a  series  of  simple  QR  suggestion  techniques,  each 
of  which  is  intuitively  better  than  the  previous  ones  and  evaluate 
them  using  our  measures  on  TDT3  and  TDT4  corpora.  Our  exper¬ 
iments  show  that  our  evaluation  metrics  numerically  capture  our 
intuitive  expectations  on  performance,  thus  informally  validating 
our  measures. 

Further,  we  also  show  that  the  second  metric  DMAP-F1,  that 
does  not  require  sub-topic  judgments,  is  consistent  in  results  as  well 
as  statistically  highly  correlated  with  the  first  metric.  This  allows 
us  to  perform  extensive  evaluations  of  the  quality  of  QR  suggestion 
techniques  on  standard  TREC  collections  in  the  future. 
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1.  INTRODUCTION 

In  today’s  age  of  information  explosion,  it  is  no  longer  sufficient 
for  IR  systems  to  merely  present  a  ranked  list  of  documents  rele¬ 
vant  to  a  user’s  query.  Considering  the  amount  of  available  infor¬ 
mation  and  the  complexity  of  information  needs,  it  is  not  realistic 
to  expect  the  user  to  patiently  sift  through  the  traditional  ranked  list 
that  runs  into  thousands  of  documents  for  most  queries.  Hence,  in 
addition  to  the  ranked  list,  the  user  needs  further  assistance  from 
the  system  in  narrowing  down  the  search  and  quickly  discovering 
the  documents  pertaining  to  his(her)  specific  information  need.  As 
Henninger  and  Belkin  rightly  put  it:  “Information  retrieval  systems 
must  not  only  provide  efficient  retrieval,  but  must  also  support  the 
user  in  describing  a  problem  that  s/he  does  not  understand  well.” 

[3] 

Query  Refinement  (QR)  suggestions,  also  called  terminological 
feedback  in  the  literature,  is  an  effective  way  to  assist  the  user  in 
quickly  locating  the  relevant  documents.  In  this  technique,  the  user 
is  presented  with  a  few  alternative  suggestions  for  refining  the  orig¬ 
inal  query.  The  user  is  expected  to  choose  one  of  the  suggestions, 
which  will  then  be  appended  to  the  original  query  and  a  new  list 
of  retrieved  documents  corresponding  to  the  refined  query  are  pre¬ 
sented  to  the  user.  The  technique  of  QR  suggestions  is  best  un¬ 
derstood  through  the  illustrative  examples  shown  in  table  1 ,  which 
we  obtained  from  popular  search  engines.  For  example,  when  a 
user  types  in  a  query  “computer  science”,  the  popular  search  en¬ 
gine  ask.com  provides  the  user  with  several  alternatives  such  as 
“research”,  “careers”,  etc.  If  the  user  is  a  graduating  student  of 
computer  science  and  is  looking  for  jobs,  (s)he  may  choose  the 
suggestion  “careers”,  upon  which  the  search  engine  issues  the  new 
query  “computer  science  careers”  and  presents  a  new  set  of  results 
to  the  user  that  are  more  relevant  to  this  specific  need. 

Note  that  QR  suggestions  are  probably  meaningful  for  only  ‘top¬ 
ical’  queries  like  the  ones  cited  in  table  1 .  This  technique  may  not 
be  applicable  for  a  known-item  finding  query  such  as  “Microsoft 
research”  or  “MIT  home  page”,  etc.  In  the  rest  of  this  work,  we  as¬ 
sume  that  we  are  dealing  with  topical  queries  for  which  providing 
QR  suggestions  is  meaningful.  No  assumption  is  however  made 
on  the  sub-topical  structure  underlying  a  query’s  topic.  The  topic 
could  have  a  hierarchical  structure  instead  of  a  flat  one,  but  the  QR 
suggestions  reveal  only  the  sub-topical  structure  at  the  next  level 
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El 

Query 

Refinement  suggestions 

Source 

i 

Computer  Science 

Research;  Careers;  Definition 

Topics;  Jobs;  History;  Projects 

www.ask.com 

2 

Diabetes 

Treatment;  Tests/diagnosis;  Symptoms;  For  patients 

Causes/risk  factors;  For  health  professionals;  Alternative  medicine 

www.google.com 

3 

Java 

virtual  machine;  applet;  tutorial;  applications; 

Indonesian;  island;  juice;  language 

www.yahoo.com 

Table  1:  illustrative  examples  for  query  refinement  suggestions 


each  time.  In  the  above  example,  if  the  sub-topic  “jobs”  has  a  fur¬ 
ther  topical  structure  beneath  it  such  as  “software”,  “hardware”, 
“administrative”,  etc.,  this  could  be  revealed  by  the  new  QR  sug¬ 
gestions  when  the  user  chooses  the  QR  suggestion  “computer  sci¬ 
ence  jobs”  at  the  first  level.  Thus,  each  time  when  a  QR  suggestion 
is  chosen  by  the  user,  (s)he  is  descended  to  the  next  level  in  the 
topic  hierarchy.  This  is  similar  to  a  tree-search  making  it  quick  and 
efficient  for  the  user  to  locate  the  specific  sub-topic  that  (s)he  is 
interested  in. 

1.1  QR  suggestions  and  Query  expansion 

It  is  important  to  note  the  distinction  of  QR  Suggestions  from  au¬ 
tomatic  query  expansion  [8,  9, 4]  which  is  a  technique  to  expand  the 
query  with  related  words  to  boost  retrieval  performance.  While  QR 
suggestions  aim  at  narrowing  down  the  focus  of  the  user’s  search 
by  presenting  specific  aspects  of  the  original  topic  to  the  user,  query 
expansion  aims  at  improving  the  overall  recall  of  the  relevant  doc¬ 
uments  in  the  ranked  list.  While  in  QR,  the  user  is  expected  to 
choose  one  of  the  alternative  refinement  suggestions,  query  expan¬ 
sion  requires  no  user  intervention.  Additionally,  query  expansion 
addresses  synonymy  problem  of  queries  well,  but  is  not  as  effective 
in  addressing  the  polysemy  problem  while  QR  suggestions  can  ef¬ 
fectively  address  both  the  problems.  For  example,  query  expansion 
may  be  able  to  retrieve  documents  that  contain  automobile  for  a 
query  on  cars ,  thus  handling  the  problem  of  synonymy.  But  when 
the  query  is  polysemous  such  as  java,  it  may  add  both  cojfee  and 
programming  as  expansion  terms  to  java.  In  contrast,  QR  sugges¬ 
tions  can  distinguish  between  cojfee  and  programming  aspects  of 
java  by  providing  them  as  separate  suggestions. 

1.2  Objectives  and  Motivation 

In  this  work,  we  are  primarily  concerned  with  measuring  the 
quality  of  Query  Refinement  suggestions.  Some  work  has  also  been 
done  on  evaluating  QR  suggestions,  but  mostly  involving  expen¬ 
sive  user  studies.  For  example,  Anick  [1]  used  query  logs  of  users 
to  demonstrate  that  QR  suggestions  can  be  useful  to  the  users.  He 
also  studied  the  effectiveness  of  QR  suggestions  by  examining  cer¬ 
tain  user  indicators  such  as  percentage  of  sessions  ending  in  a  click 
in  the  ranked  list,  whether  or  not  a  QR  suggestion  is  selected,  etc. 
However,  these  evaluation  measures  can  be  quite  expensive  since  it 
involves  user  interaction. 

Another  work  on  evaluation  that  has  similar  objectives  to  the 
present  work  is  that  of  Zhai  et  al  [10]  which  evaluates  the  abil¬ 
ity  of  an  IR  system  to  retrieve  documents  that  cover  many  different 
sub-topics  under  a  given  query’s  topic.  Their  evaluation  general¬ 
izes  the  traditional  precision  and  recall  metrics  by  accounting  for 
intrinsic  sub-topicality  as  well  as  redundancy  in  documents.  This 
work  differs  from  ours  in  the  subject  of  interest:  while  the  former 
work  evaluates  the  ability  of  the  ranked  list  to  cover  all  sub-topics 
within  a  query’s  topic,  we  are  interested  in  measuring  the  ability  of 
QR  suggestions  to  cover  all  sub-topics  of  the  query’s  topic. 

A  closely  related  problem  to  sub-topic  retrieval,  sometimes  called 


’’aspect  retrieval”,  is  investigated  in  the  interactive  track  of  TREC, 
where  the  purpose  is  to  study  how  an  interactive  retrieval  system 
can  best  support  a  user  in  gathering  the  information  about  different 
aspects  of  a  topic  [7],  Again,  this  work  consisted  of  user  studies 
rather  than  any  objective  evaluation  metric. 

As  far  as  evaluation  of  the  quality  of  QR  suggestions  is  con¬ 
cerned,  we  are  not  aware  of  any  work  that  proposes  an  objective 
evaluation  metric  that  does  not  involve  expensive  and  time-intensive 
user  studies.  We  believe  an  objective  measure  is  very  vital  for  the 
research  community  not  only  for  repeatability  of  experiments  but 
also  for  comparison  of  various  techniques  proposed  for  QR  sugges¬ 
tions  and  further  development  of  newer  techniques. 

The  rest  of  the  paper  is  organized  as  follows:  We  present  our 
new  evaluation  metrics  in  section  2.  Section  3  lists  a  few  simple 
QR  suggestion  techniques  that  we  considered  for  evaluation  while 
section  4  presents  the  results  of  our  experiments.  In  section  5,  we 
present  some  discussion  on  the  limitations  of  the  new  measures  and 
map  out  directions  for  future  work. 

2.  EVALUATION  MEASURES 

In  this  work,  we  propose  two  objective  measures  for  evaluating 
the  quality  of  QR  suggestions.  The  first  measure,  known  as  Max¬ 
imum  Matching  Averaged  Mean-Average-Precision  (MM-AMAP) 
requires  labeled  documents  for  the  sub-topics  underlying  the  query’s 
topic.  The  second  measure  which  we  call  Distinctness  and  Mean- 
Average-Precision  based  FI  (DMAP-F1)  requires  only  labeled  doc¬ 
uments  that  are  relevant  to  the  original  query. 

Our  evaluation  measures  are  based  on  the  idea  that  QR  sugges¬ 
tions  will  be  most  effective  when  the  suggestions  reveal  the  under¬ 
lying  sub-topical  structure  of  the  query’s  topic.  For  example,  in 
the  example  query  “Computer  Science”  in  table  1,  the  QR  sugges¬ 
tions  “jobs”  “departments”,  “research  areas”,  “companies”  reveal 
the  sub-topic  structure  of  the  broader  “computer  science”  topic. 
The  user  can  narrow  down  his(her)  search  by  choosing  one  QR 
suggestion  S,  say  “jobs”,  from  the  set  of  QR  suggestions  S,  upon 
which  the  system  retrieves  a  set  of  documents  Ds  by  appending  S 
to  the  original  query,  as  in  “computer  science  jobs”. 

Following  this  intuition,  the  key  idea  behind  our  evaluation  mea¬ 
sures  can  be  described  as  follows: 

The  quality  of  the  QR  suggestions  can  be  measured  objec¬ 
tively  by  the  their  retrieval  effectiveness  w.r.t.  the  sub-topic  it 
represents,  when  used  as  queries. 

Thus,  the  optimality  of  a  QR  suggestion  S  for  a  given  query 
Q,  can  be  measured  by  quality  of  the  corresponding  retrieval  set 
D  s  of  the  expanded  query,  quantified  by  Mean  Average  Precision 
(MAP)  w.r.t.  its  sub-topic  Ts  represented  by  S.  This  is  illustrated 
graphically  in  figure  1. 

Since  we  do  not  know  a  priori  which  QR  suggestion  the  user 
would  click  on,  we  assume  the  user  has  equal  probability  of  choos¬ 
ing  each  QR  suggestion.  As  a  simple  evaluation  measure,  one 
can  maximize  the  expected  retrieval  effectiveness  over  all  the  sub- 


Figure  1:  Evaluating  QR  suggestions  by  using  them  as  queries  and  measuring  the  IR  effectiveness  of  the  corresponding  retrieved  set  of  documents 
w.r.t.  respective  sub-topics 


topics  as  shown  below: 

AMAP  =  V  MAP(Ds\Ts)  (1) 

'  '  ses 

where  M AP(D s|Ts)  indicates  the  mean  average  precision  of  the 
retrieved  document  set  Ds  w.r.t.  the  sub-topic  of  the  suggestion 
S.  Thus,  the  new  evaluation  measure  AMAP  (Averaged  MAP)  is 
the  average  of  the  MAPs  of  the  each  QR  suggestion  w.r.t.  its  corre¬ 
sponding  sub-topic.  To  compute  this  measure,  it  is  evident  that  we 
need  relevance  judgments  for  the  sub- topics  underlying  the  query. 

2.1  Maximum  Matching  Averaged  MAP 

An  assumption  that  the  above  evaluation  measure  makes  is  that 
the  correspondence  between  each  QR  suggestion  S  and  its  sub- 
topic  Ts  is  known.  In  reality,  this  information  is  not  available.  Be¬ 
sides,  the  suggestions  automatically  generated  by  the  system  may 
not  even  capture  the  exact  sub-topic  structure  underlying  the  user’s 
query.  Complicating  the  matters  further  is  the  fact  that  the  number 
of  QR  suggestions  generated  by  the  system  |S|  may  not  be  equal 
to  the  actual  number  of  sub-topics  |T|  underlying  a  given  query. 
We  propose  a  greedy  bipartite  matching  algorithm  which  computes 
AMAP  by  assigning  retrieval  sets  D  =  {Dsi ,  •  •  •  ,  Ds„  }  corre¬ 
sponding  to  the  QR  suggestions  S  =  {Si,  •  •  •  ,  Sj„}  to  true  set  of 
topics  sub-topics  T  =  {Ti,  •  •  •  ,  Tn}. 

The  algorithm  is  described  in  table  2  and  is  graphically  illus¬ 
trated  in  figure  2  and  works  as  follows.  We  first  define  a  complete 
weighted  bipartite  graph  between  U ,  the  set  of  nodes  denoting  the 
retrieval  sets  D  and  V,  the  set  of  nodes  representing  the  sub-topics 
T  where  the  weight  of  each  edge  ey  is  given  by  the  average  preci¬ 
sion  of  the  retrieval  set  D,:  w.r.t.  the  sub-topic  Tj  (step  1  in  table 
2.  Next,  we  iteratively  pick  each  edge  euv  that  has  the  maximum 


Figure  2:  Maximum  Matching  AMAP  algorithm:  the  edge  thickness 
represents  its  weight 


weight  and  assign  the  corresponding  retrieval  set  Du  to  the  topic 
T„.  We  remove  all  other  edges  that  connect  to  these  nodes  each 
time.  The  greedy  assignment  is  complete  when  there  are  no  edges 
remaining  in  the  graph.  We  then  sum  the  assigned  edge  weights  and 
divide  the  sum  by  the  larger  of  the  initial  number  of  retrieval  sets  or 
sub-topics  to  obtained  the  evaluation  measure  which  we  call  Max¬ 
imum  Matching  AMAP  (MM- AMAP).  The  larger  value  is  chosen 
to  penalize  the  system  if  it  provides  too  few  or  too  many  sugges¬ 
tions  than  the  number  of  actual  sub-topics. 

2.2  Distinctness  and  MAP  based  FI  (DMAP- 
Fl) 


1.  Define  a  fully  connected  weighted  bipartite  graph  (U,  V,  E )  where 
U  —  {Dsj ,  •  •  •  ,  Dsm  }  andV  =  T  =  {Ti,  •  •  •  ,  Tn}  and 

E  =  {eij  =  (Ds; ,  Tj )  I  DSi  €UkTj€V,  W{eij)  =  M  AP{~D  Si\Tj)}. 

2.  MM-AMAP  <-  0 

3.  while  E  /  {} 

4.  euv  =  argmaxeijeE  W(eij) 

5.  MM-AMAP  «-  MM-AMAP  +  W(euv) 

6.  E  < —  E  —  \_euj\Tj  £  F}  —  {eiv\DSi  €:  17} 

7.  MM-AMAP  MM-AMAP/max(m,  n) 


Table  2:  A  greedy  Maximum  matching  algorithm  for  evaluating  QR  suggestions 


Most  of  the  standard  TREC  collections  do  not  have  any  judg¬ 
ments  for  sub-topics  underlying  the  queries.  Hence  MM-AMAP 
cannot  be  used  as  an  evaluation  technique  on  these  collections.  In 
this  subsection,  we  present  a  new  measure  relaxes  the  requirement 
for  sub-topic  judgments.  This  new  measure  is  based  on  the  follow¬ 
ing  premise:  Since  retrieved  sets  of  documents  D  =  {Dsj ,  •  •  •  ,  Dsm  } 
corresponding  to  a  QR  suggestions  S  =  {Si,  •  •  •  ,  Sm}  are  ex¬ 
pected  to  represent  distinct  sub-topics,  they  should  have  as  little 
overlap  between  them  as  possible.  Additionally,  since  all  the  sub- 
topics  are  part  of  query's  main  topic,  each  retrieval  set  should  also 
capture  the  main  topic  as  much  as  possible.  This  premise  allows  to 
us  to  evaluate  the  quality  of  QR  suggestions  as  follows: 

1.  For  each  retrieval  set  Ds; ,  compute  MAP(Ds;  |T),  the  MAP 
w.r.t.  the  query’s  main  topic  T. 

2.  Compute  AMAP  =  —  where  |S|  is  the  total 

number  of  QR  suggestions. 

3.  Compute  the  distinctness  ratio  at  rank  DR(R)  between  the 
retrieval  sets  as  the  ratio  of  the  number  of  documents  that 
occur  in  top  R  documents  in  exactly  one  of  the  retrieval  sets 
to  the  total  number  of  unique  documents  in  all  the  retrieval 
sets  put  together.  The  mean  distinctness  ratio  MDR  is  then 
given  by: 
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Distinctness  ratio  at  rank  2  =  #  (B,E) 

#(A,B,E) 


=  0.67 


Distinctness  ratio  at  rank  4  =  #(E,F,C,D) 

#(A,B,C,D,E,F) 


MDR  = 


E|!i  DR{  100  *  i) 
10 


(2) 


Figure  3:  An  example  computation  of  distinctness  ratio 


An  example  computation  of  DR(R)  is  shown  in  figure  3. 
MDR  averages  the  distinctness  ratio  at  top  100  documents 
through  top  1000  documents  in  steps  of  100  documents.  This 
measure  is  inspired  by  mean  average  precision  (MAP)  and 
accounts  for  the  fact  that  maintaining  distinctness  at  the  top 
of  the  ranked  lists  is  more  important  than  at  the  bottom. 


4.  Return  DM AP-F1 


2x  (AMAP)  x  (MDR) 
(AMAP)+(MDR) 


Thus  DMAP-F1  captures  retrieval  effectiveness  of  each  retrieval  set 
w.r.t.  to  the  main  topic  T  as  well  the  distinctness  of  each  retrieval 
set  w.r.t.  one  another. 

Note  that  distinctness  of  retrieval  sets  does  not  necessarily  guar¬ 
antee  that  each  retrieval  set  captures  a  unique  sub-topic.  This  sit¬ 
uation  is  illustrated  in  the  set-representation  of  retrieval  sets  Ds1 
and  Ds2  and  sub-topics  Tj  and  T2  in  figure  4.  In  this  example  sce¬ 
nario,  although  the  retrieval  sets  are  distinct  (no  overlap)  and  they 
together  cover  the  query's  main  topic  T,  they  fail  to  capture  the  ex¬ 
act  sub-topic  structure  T\  and  T2  of  the  main  topic.  This  is  clearly 
a  shortcoming  of  the  evaluation  metric.  However,  in  our  experi¬ 
ments,  we  demonstrate  that  it  works  well  empirically  as  shown  by 
the  strong  correlation  between  MM-AMAP  and  DMAP-F 1 . 


Figure  4:  Potential  problem  in  the  DMAP-F1  evaluation:  The  re¬ 
trieval  sets  corresponding  to  QR  suggestions  may  capture  clusters  dif¬ 
ferent  from  the  actual  topical  structure  but  may  never  be  noticed  by 
the  evaluation 


3.  TECHNIQUES 

Several  techniques  have  been  proposed  to  generate  QR  sugges¬ 
tions  automatically.  The  early  work  on  QR  suggestions  can  be 
traced  back  to  Anick  and  Tipirneni  [2]  where  they  use  different 
terms  that  the  query  words  occur  within  certain  syntactic  construc¬ 
tions  such  as  “adjective,  noun,  noun”,  etc.  as  QR  suggestions. 
Other  techniques  also  include  suggesting  hyponyms  (e.g.:  birds 
of  prey  /  falcons),  morphological  variants  (e.g.:  norse  myth/  norse 
myths),  acronyms  (e.g.:  USA  /  United  States  of  America),  etc.  [1], 

In  a  work  that  is  similar  to  the  technique  of  QR  suggestions, 
Sanderson  and  Croft  [6]  came  up  with  a  technique  based  on  sub¬ 
sumption  relationships  between  terms  to  derive  a  concept  hierarchy 
for  each  query.  Lawrie’s  work  on  hierarchical  summarization  [5] 
also  comes  quite  close  to  our  objective  here.  She  provided  a  lan¬ 
guage  modeling  based  framework  to  choose  good  topic  words  and 
then  create  hierarchy  that  provides  a  summary  of  the  underlying 
collection  of  documents. 

In  this  work,  all  the  techniques  we  considered  in  our  experiments 
extract  terms  from  the  top  ranking  documents  from  the  initial  re¬ 
trieval  based  on  the  user’s  original  query.  Our  term  selection  is  pri¬ 
marily  based  on  statistical  term  weighting  and  ignores  all  linguistic 
information. 

The  main  reason  for  considering  simple  techniques  is  to  pro¬ 
vide  a  sanity  check  to  our  evaluation  measures.  The  techniques  we 
present  below  are  very  intuitive  and  each  technique  overcomes  po¬ 
tential  flaws  in  the  earlier  techniques  in  the  order  presented  below. 
Thus  we  expect  intuitively  sounder  techniques  to  perform  better  on 
our  evaluation  measures.  This  could  serve  as  an  informal  verifica¬ 
tion  of  the  soundness  of  the  evaluation  measures.  We  would  like  to 
emphasize  that  the  main  contributions  of  this  paper  are  the  evalua¬ 
tion  measures  and  not  the  techniques.  In  the  future  we  would  like  to 
experiment  with  other  existing  techniques  for  QR  suggestion  using 
our  new  evaluation  measures. 


1.  TF-IDF:  In  this  technique,  we  sort  the  terms  in  the  top  1000 
documents  from  the  original  query  by  their  TF-IDF  weights 
given  by: 
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where  ATiooo(i)  is  the  number  of  documents  that  the  term  i 
occurs  in  the  top  1000  documents,  N(t)  is  the  total  number 
of  documents  in  the  collection  that  contain  i,  TFRaw(t,  D) 
is  the  number  of  times  t  occurs  in  the  given  document  D, 
N(t)  is  the  total  number  of  documents  in  which  the  term 
occurs  and  N  is  the  collection  size. 


Given  parameters  nt,  the  number  of  terms  in  each  QR  sug¬ 
gestion,  ns ,  the  number  of  suggestions  and  the  sorted  array  of 
terms  tvec ,  the  set  of  query  refinement  suggestions  is  given 
by:  Si  —  {tvec[(i—  1)  *  nt  + 1],  •  •  •  ,  tvec[i*nt]}.  In  other 
words  the  top  n<  terms  are  used  as  the  first  QR  suggestion, 
the  next  nt  words  are  used  for  the  next  QR  suggestion  and  so 
on. 


2.  C-TF-IDF:  The  above  technique  makes  no  attempt  to  cap¬ 
ture  the  sub-topic  structure  of  the  query’s  topic  in  generating 


feedback  suggestions.  This  technique  rectifies  the  drawback 
by  first  clustering  the  top  1000  documents.  We  use  a  simple 
online  clustering  algorithm  that  determines  cluster  member¬ 
ship  using  TF-IDF  weighted  cosine  similarity  and  a  threshold 
T,  where  the  clustering  is  done  in  the  rank  order  of  the  doc¬ 
uments.  These  clusters  are  assume  to  represent  the  sub-topic 
structure  of  the  query’s  topic.  Given  the  parameter  nt  as 
above,  nt  top  ranking  TF-IDF  weighted  terms  are  extracted 
from  each  cluster  as  a  QR  suggestion.  Thus  there  are  as  many 
QR  suggestions  as  there  are  number  of  clusters,  which  in  turn 
depends  on  the  cluster  threshold  value  nt  ■ 

Note  that  we  used  online  clustering  because  it  is  one  of  the 
most  computationally  efficient  clustering  techniques  and  is 
therefore  a  suitable  technique  in  this  scenario  where  response 
time  to  the  user  is  expected  to  be  as  low  as  possible.  Addi¬ 
tionally,  a  threshold  based  clustering  allows  different  num¬ 
ber  of  clusters  for  different  queries(topics)  depending  on  the 
inter-document  similarity  of  the  topic,  providing  higher  flex¬ 
ibility  than  a  clustering  algorithm  that  fixes  the  number  of 
clusters  a  priori. 

3.  C-TF-IDF-ICF:  This  technique  aims  to  improve  the  dis¬ 
tinctness  of  the  retrieval  sets  of  the  QR  suggestions  by  adding 
an  extra-weight  to  terms  which  we  call  the  Inverse  Cluster 
Frequency  weight  as  shown  below: 

C-TF-IDF-ICF(f)  =TF  —  IDF(t)  *  log 

(4) 

where  Nc  is  the  number  of  clusters  and  Nc  (f )  is  the  number 
of  clusters  the  term  occurs  in. 

Similar  to  the  IDF  weight,  it  down-weights  terms  that  occur 
in  all  the  clusters  and  prefers  terms  that  are  unique  to  the 
given  cluster. 

4.  C-TF-IDF-ICF-RW:  The  previous  algorithms  do  not  differ¬ 
entiate  between  terms  that  occur  in  documents  at  the  top  of 
the  ranked  list  and  those  that  occur  at  the  bottom  of  the  rank 
list.  This  algorithm  is  same  as  the  last  one  except  in  that  it 
adds  an  extra  rank  weight  (RW)  equal  to  the  okapi  score  of 
the  document  in  the  original  retrieval,  that  scores  terms  from 
high  ranking  documents  higher  than  the  ones  from  the  low 
ranked  ones.  In  other  words,  the  new  weight  of  a  term  in  a 
document  TF-RW(t.D)  is  computed  from  TF(t,D)  formula 
in  equation  3  as  follows: 

TF-RW(f,D)  =  TF(t,D)  x  okapi-score(Z),  Q)  (5) 

The  objective  is  to  eliminate  terms  that  may  have  high  TF- 
IDF  weights  but  may  be  unrelated  to  the  query’s  topic. 

5.  Upper-bound:  In  this  case,  each  query  is  provided  with  ex¬ 
actly  as  many  QR  suggestions  as  the  number  of  actual  sub- 
topics.  In  addition,  the  sub-topic  descriptions  provided  by 
TDT  annotators  are  used  as  QR  suggestions.  This  is  clearly 
an  artificial  scenario,  but  provides  an  estimate  of  the  perfor¬ 
mance  of  the  best  possible  system. 

4.  EXPERIMENTS 
4.1  Data 

The  standard  TREC  data  collections  do  not  contain  judgments 
for  sub-topics.  The  Topic  Detection  and  Tracking  1  corpus  on  the 

1  http://www.nist.gov/speech/tests/tdt/ 


other  hand,  has  this  desirable  property  of  a  two  level  topic  structure 
and  hence  we  used  this  in  our  experiments. 

The  TDT  corpora  contain  news  stories  from  multiple  sources 
such  as  audio,  video  and  news  wire  from  multiple  sources  such  as 
CNN,  New  York  Times,  ABC  News  etc.,  and  multiple  languages 
such  as  English,  Mandarin  and  Arabic.  When  the  source  is  non- 
English,  machine  translation  output  to  English  is  available.  When 
the  source  is  audio  or  video,  manual  transcription  feeds  or  auto¬ 
matic  speech  recognition  output  are  used.  We  used  manually  tran¬ 
scribed,  machine  translated  sections  of  the  TDT3  and  TDT4  cor¬ 
pora  which  contain  101,765  and  98,245  documents  respectively. 

The  top  level  of  the  two-level  hierarchical  topical  structure  of 
TDT  corpus  is  called  Rules  of  Interpretation  (ROI)  categories  which 
contains  broad  categories  like  “Acts  of  War”,  “Celebrity  news”, 
“Elections”,  etc.  Under  each  ROI  category  there  are  several  top¬ 
ics  and  labeled  documents  on  these  topics  are  made  available.  For 
example,  under  the  ROI  category  “Acts  of  violence  and  war”,  there 
are  topics  such  as  “Bogota  Plane  hijacking”,  “Palestinian  child  killed 
in  cross-fire”,  “Car  bombings  in  Spain”  etc.  We  excluded  the  “Mis¬ 
cellaneous”  ROI  category  front  each  corpus  since  it  contains  unre¬ 
lated  topics  and  is  thus  not  a  good  representation  of  a  topical  struc¬ 
ture. 

We  considered  each  ROI  category  as  our  query’s  main  topic  and 
the  topics  under  each  ROI  as  our  sub-topics  under  the  query.  Hence¬ 
forth,  we  will  refer  to  each  ROI  category  as  ROI  topic  and  the  TDT 
topics  as  our  sub-topics.  When  a  query  is  issued  on  the  ROI  topic 
“Acts  of  violence  and  “War”,  the  QR  suggestions  are  expected  to 
reveal  the  sub-topic  underneath  it  such  as  “Bogota  Plane  hijack¬ 
ing”,  “Palestinian  child  killed  in  cross-fire”,  etc. 

In  all,  we  have  10  ROIs  from  TDT3  corpus  and  11  ROIs  from 
TDT4  corpus.  Under  the  ROIs  we  chose  to  use,  there  are  80  sub- 
topics  in  TDT3  and  65  sub-topics  in  TDT4  corpus  that  have  judged 
documents.  The  number  of  sub-topics  per  each  ROI  topic  ranges 
from  2  to  21  in  TDT3  corpus  and  from  2  to  17  in  TDT4  corpus. 

The  TDT  corpus  is  primarily  built  for  event  based  organization 
of  news  stories.  Although  it  contains  topic  judgments  which  we 
can  use  for  evaluation,  it  doesn’t  contain  any  queries  for  each  ROI. 
The  ROI  titles  such  as  “Acts  of  violence  and  war”  are  too  general  to 
be  used  as  queries  since  relevant  documents  may  not  contain  those 
exact  words,  hence  we  generated  artificial  queries  for  each  ROI  by 
extracting  top  8  TF-IDF  terms  from  the  union  of  judged  relevant 
documents  of  all  sub-topics  under  each  ROI  topic. 


same  process.  The  results  of  the  two  test  sets  are  then  merged  to 
obtain  21  data  points  corresponding  to  21  ROI  queries. 

The  results  of  our  experiments  are  presented  in  table  3.  Results 
show  consistent  trends  between  the  evaluation  measures  as  well  as 
a  strong  statistical  correlation  in  terms  of  Pearson  correlation  coef¬ 
ficient  (that  ranges  between  -1  and  +1;  values  0  indicate  positive 
correlation).  Also  notice  that  each  successive  algorithm  performs 
better  than  the  previous  one  on  both  measures,  informally  validat¬ 
ing  our  measures  since  each  successive  algorithm  overcomes  the 
flaws  in  the  previous  one. 

5.  CONCLUSIONS 

In  this  paper,  we  presented  two  IR  based,  objective  two  evalua¬ 
tion  measures  to  estimate  the  quality  of  query  refinement  sugges¬ 
tions.  While  the  first  one  MM-AMAP  relies  on  the  availability  of 
sub-topic  judgments  for  the  query’s  main  topic,  the  other  measure 
eliminates  this  requirement  by  estimating  the  distinctness  of  the  re¬ 
trieval  sets  w.r.t.  one  another.  The  correlation  between  the  two 
measures  is  established  not  only  by  the  Pearson  test,  but  also  by  the 
consistency  of  results  between  the  two  measures. 

One  of  the  limitations  of  the  current  work  is  the  relatively  small 
number  of  queries  (21)  that  we  performed  the  experiments  on.  This 
is  mainly  due  to  the  non-availability  of  sub-topic  judgments  in  stan¬ 
dard  research  collections.  The  second  evaluation  metric  DMAP-F1 
address  precisely  this  issue  and  eliminates  the  need  for  sub-topic 
judgments,  allowing  us  to  perform  experiments  on  larger  number 
of  queries  using  TREC  collections  and  judgments  in  the  future. 

Another  important  limitation  of  the  current  evaluation  measure 
is  that  it  fails  to  take  into  account  user  experience.  A  QR  sugges¬ 
tion  can  be  effective  in  retrieving  documents  on  a  sub-topic  when 
used  as  a  new  query,  but  it  is  not  of  much  use  if  the  user  does  not 
comprehend  it.  For  example,  given  a  query  “information  retrieval”, 
a  QR  suggestion  such  as  “spider  robot”  may  be  effective  in  retriev¬ 
ing  documents  on  the  sub-topic  “automatic  crawling  and  indexing” 
and  hence  may  be  rated  high  by  our  evaluation  metric.  But  it  is 
not  clear  if  this  suggestion  would  help  user  understand  the  subject 
sub-topic.  Hence  one  of  our  future  plans  is  to  measure  the  correla¬ 
tion  between  our  measures  and  user  satisfaction  and  also  to  develop 
predictors  of  user  satisfaction  using  objective  evaluation  metrics. 

We  also  intend  to  perform  more  extensive  experiments  compar¬ 
ing  other  published  QR  suggestion  techniques  in  the  future. 


4.2  Results 

We  indexed  the  collection  using  the  Lemur  software.  We  did 
stopping  using  a  standard  stop-list  and  stemming  using  K-stemmer. 
We  built  Lemur  APIs  for  all  our  algorithms  listed  in  section  3.  We 
used  the  okapi  retrieval  method  for  basic  retrieval  for  our  our  orig¬ 
inal  query  as  well  as  the  retrieval  for  QR  suggestions  (needed  for 
evaluation).  The  okapi  TF-IDF  weight  for  a  query  term  t  in  a  doc¬ 
ument  D  is  given  as  follows: 
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For  all  our  algorithms,  we  did  a  two-fold  cross  validation:  We 
optimized  the  parameters  of  each  algorithm  by  maximizing  the  ob¬ 
jective  function  (MM-AMAP)  on  one  corpus  and  its  corresponding 
set  of  queries  and  tested  the  algorithms  on  the  other  corpus  and  its 
corresponding  queries,  with  the  parameters  set  at  these  optimal  val¬ 
ues.  We  then  switched  the  training  and  test  sets  and  repeated  the 
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